{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multiclass Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example show shows how to use `tsfresh` to extract and select useful features from timeseries in a multiclass classification example.\n",
    "\n",
    "We use an example dataset of human activity recognition for this.\n",
    "The dataset consists of timeseries for 7352 accelerometer readings. \n",
    "Each reading represents an accelerometer reading for 2.56 sec at 50hz (for a total of 128 samples per reading). Furthermore, each reading corresponds one of six activities (walking, walking upstairs, walking downstairs, sitting, standing and laying).\n",
    "\n",
    "For more information go to https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones\n",
    "\n",
    "This notebook follows the example in [the first notebook](./01%20Feature%20Extraction%20and%20Selection.ipynb), so we will go quickly over the extraction and focus on the more interesting feature selection in this case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pylab as plt\n",
    "\n",
    "from tsfresh import extract_features, extract_relevant_features, select_features\n",
    "from tsfresh.utilities.dataframe_functions import impute\n",
    "\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import classification_report\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and visualize data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tsfresh.examples.har_dataset import download_har_dataset, load_har_dataset, load_har_classes\n",
    "\n",
    "# fetch dataset from uci\n",
    "download_har_dataset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = load_har_dataset()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = load_har_classes()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data is not in a typical time series format so far: \n",
    "the columns are the time steps whereas each row is a measurement of a different person.\n",
    "\n",
    "Therefore we bring it to a format where the time series of different persons are identified by an `id` and are order by time vertically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"id\"] = df.index\n",
    "df = df.melt(id_vars=\"id\", var_name=\"time\").sort_values([\"id\", \"time\"]).reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.title('accelerometer reading')\n",
    "plt.plot(df[df[\"id\"] == 0].set_index(\"time\").value)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extract Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# only use the first 500 ids to speed up the processing\n",
    "X = extract_features(df[df[\"id\"] < 500], column_id=\"id\", column_sort=\"time\", impute_function=impute)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Train and evaluate classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For later comparison, we train a decision tree on all features (without selection):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y[:500], test_size=.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "classifier_full = DecisionTreeClassifier()\n",
    "classifier_full.fit(X_train, y_train)\n",
    "print(classification_report(y_test, classifier_full.predict(X_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multiclass feature selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now select a subset of relevant features using the `tsfresh` select features method.\n",
    "However it only works for binary classification or regression tasks. \n",
    "\n",
    "For a 6 label multi classification we therefore split the selection problem into 6 binary one-versus all classification problems. \n",
    "For each of them we can do a binary classification feature selection:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "relevant_features = set()\n",
    "\n",
    "for label in y.unique():\n",
    "    y_train_binary = y_train == label\n",
    "    X_train_filtered = select_features(X_train, y_train_binary)\n",
    "    print(\"Number of relevant features for class {}: {}/{}\".format(label, X_train_filtered.shape[1], X_train.shape[1]))\n",
    "    relevant_features = relevant_features.union(set(X_train_filtered.columns))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(relevant_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we keep only those features that we selected above, for both the train and test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_filtered = X_train[list(relevant_features)]\n",
    "X_test_filtered = X_test[list(relevant_features)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and train again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classifier_selected = DecisionTreeClassifier()\n",
    "classifier_selected.fit(X_train_filtered, y_train)\n",
    "print(classification_report(y_test, classifier_selected.predict(X_test_filtered)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It worked! The precision improved by removing irrelevant features."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
